-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
SPU: Update CELL Communication Performance #17646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a6a2fc0 to
9f8c31f
Compare
|
Added "SPURS oriented thread waiting" which is gonna replace "Preferred SPU Threads" setting and be active by default. |
cafffb4 to
8d667a9
Compare
|
|
||
| constexpr u32 _1m = 1u << 20; | ||
|
|
||
| std::unique_lock fast_lock(render->sys_rsx_mtx, std::defer_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the benefit of the double check under lock especially since we expect frame-to-frame the mappings wont actually change. Why not just lock and check once? I feel that would be faster here.
rpcs3/Emu/Cell/SPUThread.cpp
Outdated
| break; | ||
| } | ||
|
|
||
| const u64 current = get_system_time(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pointed it out before, but get_system_time is unreasonably heavy. Prefer TSC unless real-world precise values are required.
A general note - the spu_info logic (test_and_update_atomic_op_info) in general is quite heavy-handed with all the atomic ops and may eat into performance. The biggest issue I see is that there is no fast-path through this calling sequence (and the corresponding one below). Yes, spurs itself is going to be almost always running task groups but we also observe that in most games the parallel misses themselves aren't too bad with modern processors, though I agree we need something more sophisticated than the quick hack that was the preferred threads option.
This is all theory of course, we'll just have to see if it ends up worth the overhead with the big hitters like RDR, TLOU or killzone titles.
rpcs3/Emu/Cell/SPUThread.cpp
Outdated
|
|
||
| spu_info[index].release(info); | ||
|
|
||
| for (usz i = 0; i < spu_info.size(); i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can abuse vector ops for this sequence and gain implicit atomicity.
Have the spu info as an object of arrays instead of array of objects.
Then you can just load all of them at once and (ab)use vector ops on the vector to figure out how much overlap there is.
On x86 at least, vector ops are atomic as long as they are naturally aligned too so we basically get that for free.
|
I've pushed an experimental update, please test. If it works I'll put it under a special setting. |
|
i9-13900K | RTX 3080 |
|
kz3 is still very bad (40 fps vs 69). GPU usage in particular is very low. EDIT: attached also the log |
| std::lock_guard lock(g_camera.mutex); | ||
|
|
||
| *info = g_camera.info; | ||
| CellCameraInfoEx info_out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
Zero comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't touch VM memory under mutex for few reasons. (RSX access violations lengthens the duration of the lock for example)
We can put it in the coding guidelines.
There is no need to comment it each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then a wrapper construct makes more sense, otherwise it will be repeated again elsewhere. Or maybe unlocked probe_for_read / probe_for_write makes more sense like usually done in real drivers.















Optimizations:
sys_memory_get_page_attributeinternally.The writer lock in
sys_memory_get_page_attributewas causing SPUs to wait unjustly.sys_rsx_context_iomap.spu_thread::reservation_checkaddress receptacle from writer_lock detection and waiting.spu_thread::reservation_check(hash)overload for main and stack memory.spu_thread::reservation_checkwhen the address is on the same page asGETLLAR's effective address.Fixes #14724